CN116883681B

CN116883681B - Domain generalization target detection method based on countermeasure generation network

Info

Publication number: CN116883681B
Application number: CN202310999356.XA
Authority: CN
Inventors: 张弘; 周炫锋; 杨一帆; 李亚伟
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2024-01-30
Anticipated expiration: 2043-08-09
Also published as: CN116883681A

Abstract

The invention discloses a domain generalization target detection method based on an countermeasure generation network. Firstly, constructing a feature extraction network to extract the features of an input image; the structural domain excitation attention module further extracts the features extracted by the feature extraction network, so that the domain generalization capability of the structural domain excitation attention module is improved; constructing a feature pyramid network FPN to perform multi-scale fusion on the feature extraction network features; constructing an countermeasure generation network regularization module, aligning the features extracted by the FPN with standard Gaussian distribution, and avoiding the over fitting of the FPN; constructing a detection head network, and predicting the position, the category and the target center position of a detection target; and constructing a target center alignment module, performing countermeasure training on the FPN extracted features, and further improving the domain generalization capability of the FPN extracted features. The network structure adopted by the invention has reasonable design, can overcome the problems of weak generalization capability of extracting features and the like of the existing target detection method, and enhances the robustness of target detection.

Description

Domain generalization target detection method based on countermeasure generation network

Technical Field

The invention relates to the field of pattern recognition, in particular to a domain generalization target detection method based on an countermeasure generation network.

Background

Object detection is the task of locating a specific object in an image, a fundamental problem in computer vision. In recent years, the performance of target detection by a target detection method based on a deep Convolutional Neural Network (CNN) has been remarkably improved under the promotion of the development of the CNN.

Target detection studies can be categorized into anchor-based and anchor-free detectors. An anchor-based detector generates target offers by means of a set of anchors and formulates target detection as a series of classification tasks for the offers. Faster-RCNN is an open-ended anchor-based detector in which a Regional Proposal Network (RPN) is used for proposal generation. Due to its effectiveness, RPN is widely used in many anchor-based detectors. The anchor-free detector skips proposal generation and locates the object directly based on a Full Convolutional Network (FCN). Recently, the anchor-free method has utilized the key point, i.e., the center or corner of the box, to locate and achieve comparable performance to the anchor-based method. However, these methods require complex post-processing to group the detection points. To avoid such a process, FCOS proposes pixel-by-pixel prediction, which directly predicts the class and offset of each position-corresponding object on the feature map. In this work we use the nature of the anchor-free method to identify the distinct regions of the alignment process.

However, in many other scenarios, there is a certain data distribution offset of the data used in CNN training from the data actually subject to target detection, the former being referred to as the source domain and the latter as the target domain. There may be no data for the target domain in CNN training, but it is still required to build an accurate model for the "invisible" target domain. Each type of background or view may be considered a field herein. Due to the distribution offset between the training source domain and the unknown test target domain, detectors trained on the reference data set do not always achieve satisfactory detection results when applied to new scenarios. To overcome the influence of the distribution offset, a Domain Adaptation (DA) method and a Domain Generalization (DG) method are proposed to improve the performance in the target domain. The DA method requires target data to train the new model in the face of a new target scene, so their performance depends largely on the distribution of the target domain. Furthermore, the DA method is based on the assumption that the target domain sample can be obtained in large quantities, which is impractical in some cases. On the other hand, the DG method can be more conveniently implemented in practice by learning a domain invariant model without a target domain sample. The basic idea of DG approach is to combine source data in some way to generate a model that is invariant to specific target data, making the model perform satisfactorily on different target scenarios. However, existing DG approaches degrade when the difference between the source domain and the target domain is large, because the model trained on the source domain may not represent samples from the target domain scene well.

In previous work, the domain generalization problem was mainly solved in two ways. In one aspect, some methods aggregate information from a source domain to learn a domain invariant representation. In particular, the domain invariant transformation is learned by minimizing the distance between domains, i.e., the learning of the detector is done by simply putting together all training data from different domains. On the other hand, there are some works to train the detector or adjust its weight with all the information from the source domain. However, these methods degrade when the difference between the source scene and the target domain scene is large. In the present invention, a domain excitation attention block is used to weight input features according to their domain specific weights. In fact, the proposed method is similar to the first method but essentially different. The present invention attempts to resolve a target domain to a source domain by applying different weights to domain-specific features and ultimately outputs an adaptive representation that is applicable to models trained on the source domain.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a domain generalization target detection method based on an countermeasure generation network, which comprises the steps of firstly constructing a feature extraction network to extract input image features; the structural domain excitation attention module further extracts the features extracted by the feature extraction network, so that the domain generalization capability of the structural domain excitation attention module is improved; constructing a feature pyramid network FPN to perform multi-scale fusion on the feature extraction network features; constructing an countermeasure generation network regularization module, aligning the features extracted by the FPN with standard Gaussian distribution, and avoiding the over fitting of the FPN; constructing a detection head network, and predicting the position, the category and the target center position of a detection target; and constructing a target center alignment module, performing countermeasure training on the FPN extracted features, and further improving the domain generalization capability of the FPN extracted features. The method can solve the problems of weak generalization capability of extracting the characteristics and the like of the existing target detection method, and enhance the robustness of target detection.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a domain generalization target detection method based on an countermeasure generation network comprises the following steps:

step (1) giving annotated images I from K different source domains _l L.epsilon. {1, … K }, their true labels y _l L epsilon {1, … K } and target domain image I without labels _T The goal is to predict annotation y on the target domain image _T ；

Step (2) constructing a domain excitation attention module, wherein the domain excitation attention module inputs a feature map output by a backbone network and outputs the enhanced feature map;

step (3) constructing a backbone network as a feature extractor, inputting an image and outputting the extracted features; the domain excitation attention module is inserted before each pooling operation of the backbone network to realize the domain enhancement of the features;

step (4) constructing an countermeasure generation network regularization module, and realizing regularization of the features by aligning the input features with standard Gaussian distribution, so as to avoid network overfitting;

and (5) constructing FPN, and fusing features of different scales extracted by the backbone network to realize multi-scale domain feature alignment. Inserting an antagonism generation network regularization module after each scale of FPN is output so as to improve the generalization of the features; wherein FPN represents a feature pyramid network;

step (6), constructing a detection head network, and predicting the position, classification and target center position of a detection target;

and (7) constructing a target center alignment module, inputting the classification output by the classification head and the target center position, integrating the classification output by the classification head and the target center position into a domain attention area for focusing on the characteristics output by the FPN, and performing domain countermeasure learning on the characteristics of the attention area output by the FPN to further improve the domain generalization capability of the extracted characteristics of the FPN.

Further, in the step (2), the domain excitation attention module includes three operations of compression, excitation and classification; the compression operation compresses the input feature map from the dimension W×H×C to 1×1×C by adopting a global average pooling operation, wherein W, H, C represents the width, the height and the dimension of the input feature map respectively;

the excitation operation generates a 1×1×c feature map output by the compression operation through the full connection layer and the activation function ReLU, i.e., the modified linear unit _E Intermediate feature F of (2) _E Generating a 1 multiplied by C feature map through the full connection layer and an activation function ReLU, and multiplying the feature map by W multiplied by H multiplied by C feature input by the domain excitation attention module according to channel weight;

intermediate features F generated by classification operations with excitation operations _E For input, the domain category is output through a full connection layer and an activation function, namely a maximum response extraction function SoftMax.

Further, in the step (3), the backbone network is a residual neural network ResNet-101.

Further, in the step (4), the countermeasure generation network regularization module consists of a global average pooling, a discrimination network and a standard gaussian distribution; wherein the global averaging pooling compresses the input feature map from the dimension w×h×c to 1×1×c; the method comprises the steps of judging whether a network is a full convolution layer of two layers with an activation function, namely a logic Stir function Sigmoid, inputting the characteristics of global average pooling output, and judging whether the input characteristics come from an input of an countermeasure generation network regularization module or come from standard Gaussian distribution sampling; the training discrimination network improves the discrimination accuracy and simultaneously promotes the features extracted by the FPN to obey the standard normal distribution so as to improve the generalization capability of the features.

Further, in the step (5), the FPN uses 5 feature maps of different scales, denoted as F _i I= {3,4, …,7}; feature map F ₃ Corresponding to the object with the smallest scale, the feature map F ₇ Corresponding to the largest scale target.

Further, in the step (6), the FCOS detection head is used to predict the location, classification, and center location of the target.

Further, in the step (7), a region map F of the object is estimated from the class output map of the detection head network _obi ：

Wherein F is _cls Representing the network class output, σ represents the sigmoid activation function,the highest response value corresponding to each category in the input vector is taken as output;

further combining the object center position map to calculate a foreground position estimation map:

wherein,representing element-by-element multiplication, F _ctr Representing a network output object center position diagram, wherein beta represents a scaling factor ranging from 0 to 1;

foreground position estimation map F _CA As a region of interest estimation map in FPN output features, F is taken as _CA The FPN characteristics corresponding to the FPN characteristics are multiplied by channels to obtain a weighted characteristic diagram F _W ，F _W Post-infusion through gradient inversion module GRLJudging the class of the network output domain by entering the domain;

the gradient inversion module is composed of a gradient inversion layer R (x), which is defined by the following formula:

R(x)＝x

wherein x represents any input feature, and I represents an identity matrix;

the domain discrimination network consists of two layers of convolution layers with the convolution kernel size of 3, the step length of 1, the same input and output dimensions, the convolution layer with the activation function of ReLU and one layer of convolution kernel of 1, the step length of 1, the output dimension of 2 and the convolution layer with the activation function of softMax;

the domain discrimination accuracy is improved by training the discrimination network, meanwhile, the FPN is promoted to pay attention to the domain invariance of the region of interest in the characteristics, and the generalization capability of the FPN for extracting the characteristics is improved.

Compared with the prior art, the invention has the beneficial effects that: the network structure adopted by the invention has reasonable design, can extract the characteristic of strong generalization capability, weakens the influence of domain distribution offset on the detection result, and has the following advantages:

(1) The invention provides a domain excitation attention module, which acquires the importance degree of different channel feature graphs in a main network on different domain samples by constructing a new excitation neural network; when a new sample is input into the network, the excitation neural network can acquire the similarity degree of the current sample and each domain sample in the source domain, strengthen the characteristic channel corresponding to the source domain similar to the sample according to the similarity degree, and inhibit the characteristic channel corresponding to the source domain irrelevant to the sample.

(2) The invention provides a method for regularizing network extracted features by antagonizing network generation, which promotes the network extracted features to follow normal distribution, thereby avoiding network overfitting Yu Yuanyu and improving domain invariance of the network.

(3) The invention provides a target center domain alignment method, which promotes network enhancement to domain generalization of image foreground features so as to weaken the influence of image background special gift on network domain generalization capability.

Drawings

FIG. 1 is a general flow chart of a domain generalization target detection method based on an countermeasure generation network of the present invention;

FIG. 2 is a detailed block diagram of the stimulus attention module;

FIG. 3 is a detailed block diagram of an countermeasure network regularization module;

FIG. 4 is a graph of the effect of target detection by the method of the present invention.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

As shown in fig. 1, the domain generalization target detection method based on the countermeasure generation network of the present invention includes the following steps:

step (1) giving annotated images I from K different source domains _l L.epsilon. {1, … K }, their true notation is y _l L epsilon {1, … K } and target domain image I without labels _T The goal is to predict target domain image I _T Marking y on _T 。

And (2) constructing a domain excitation attention module, inputting the feature map output by the backbone network, and outputting the enhanced feature map.

And (3) constructing a backbone network as a feature extractor, inputting the image and outputting the extracted features. The domain enhancement of features is achieved by inserting a domain stimulus attention module before each pooling operation of the backbone network.

And (4) constructing an countermeasure generation network regularization module, and realizing regularization of the features by aligning the input features with standard Gaussian distribution, so as to avoid network overfitting.

And (5) constructing a Feature Pyramid Network (FPN), and fusing features of different scales extracted by the backbone network to realize multi-scale domain feature alignment. An countermeasure generation network regularization module is inserted after each scale output of the FPN to improve generalization of the features.

And (6) constructing a detection head network, and predicting the position, classification and target center position of the detection target.

Further, in the step (2), as shown in fig. 2, the domain excitation attention module includes three operations of compression, excitation and classification. Wherein the compressing operation compresses the input feature map from the dimension w×h×c to 1×1×c using a global averaging pooling operation, wherein W, H, C represents the width, height, and dimension of the input feature map, respectively.

The excitation operation generates a 1×1×c feature map output by the compression operation through the full connection layer and the activation function ReLU, i.e., the modified linear unit _E Intermediate feature F of (2) _E And generating a 1 multiplied by C characteristic diagram through the full connection layer and the ReLU, and multiplying the characteristic diagram with W multiplied by H multiplied by C characteristic inputted by the domain excitation attention module according to channel weight.

In the step (3), the backbone network is a residual neural network ResNet-101.

In the step (4), as shown in fig. 3, the countermeasure generation network regularization module is composed of a global average pooling, a discrimination network and a standard gaussian distribution. Wherein the global averaging pooling compresses the input feature map from the dimension w×h×c to 1×1×c; the method is characterized in that the network is judged to be a full convolution layer of two layers with an activation function, namely a logic Stirling function Sigmoid, the characteristics of global average pooling output are input, and whether the characteristics of the input come from the input of the module or come from sampling from standard Gaussian distribution is judged. The training discrimination network improves the discrimination accuracy and simultaneously promotes the features extracted by the FPN to obey the standard normal distribution so as to improve the generalization capability of the features.

In the step (5), FPN is 5 differentA feature map of scale, which can be expressed as F _i I= {3,4, …,7}. Feature map F ₃ Corresponding to the object with the smallest scale, the feature map F ₇ Corresponding to the largest scale target.

In the step (6), the FCOS detection head is used to predict the target position, classification and target center position.

In the step (7), the region map F in which the object exists can be estimated from the class output map of the detection head _obj ：

Wherein F is _cls Representing the network class output, σ represents the sigmoid activation function,the highest response value corresponding to each category in the input vector is taken as output.

The foreground position estimation map F can be further calculated by combining the object center position map _CA ：

Wherein,representing element-by-element multiplication, F _ctr The object center position diagram output by the network is shown, and beta represents a scaling factor ranging from 0 to 1.

Foreground position estimation map F _CA As a region of interest estimation map in FPN output features, F is taken as _CA The FPN characteristics corresponding to the FPN characteristics are multiplied by channels to obtain a weighted characteristic diagram F _W 。F _W And judging the category of the network output domain by the input domain after passing through the gradient inversion module GRL.

R(x)＝x

where x represents any input feature and I represents an identity matrix.

The domain discrimination network consists of two layers of convolution layers with the convolution kernel size of 3, the step length of 1, the same input and output dimensions, the convolution layer with the activation function of ReLU and one layer of convolution kernel of 1, the step length of 1, the output dimension of 2 and the convolution layer with the activation function of softMax; the domain discrimination accuracy is improved by training the discrimination network, meanwhile, the FPN is promoted to pay attention to the domain invariance of the region of interest in the characteristics, and the generalization capability of the FPN for extracting the characteristics is improved.

In combination with the above steps, the specific formulas of the present invention include:

(1) In order to train the parameters of the domain excitation attention module in step (2), a domain excitation attention loss function L is proposed _atten ：

Wherein N represents the number of input samples, M represents the number of domain categories of the source domain, y _id Domain tag truth value, p, representing sample _id Representing the domain class of the output of the excitation neural network.

(2) In order to train the parameters of the anti-generation network regularization module in the step (4), a distributed regularization loss function L is provided _regular ：

L _regular ＝E _h～q(h) logD(h)+E _x～p(x) log(1-D(G(x)))

Wherein q (h) obeys standard normal distribution, p (x) is distribution of input data, D represents domain discrimination network, G represents feature extraction network, E _h～q(h) Representing the expectation on the distribution q (h), E _x～p(x) Indicating the desire over the distribution p (x).

(3) To train the parameters of the target central domain alignment module in step (7),proposing a loss function L of a target central domain alignment module _CA ：

Wherein D represents a domain tag, D _CA Is a domain arbiter;respectively representing foreground position estimation graphs corresponding to a source domain and a target domain; f (F) _s ，F _t Respectively representing network feature graphs corresponding to a source domain and a target domain; />Representing element-by-element multiplication; for any one of the characteristic diagrams A, A ^(u，v) The feature map is represented by a feature corresponding to the position (u, v).

(4) Network overall loss function L:

L＝L _det +αL _atten +βL _regular +γL _CA

wherein alpha, beta, gamma are weights for balancing losses, L _det To detect the loss function, and:

L _det ＝L _cls +L _reg +L _ctr

wherein L is _cls Representing head loss, L _reg Represents regression head loss, L _ctr Indicating a loss of center position.

Fig. 4 shows the results of a test in a source domain dataset after training the detector of the present invention using the source domain dataset. The method improves generalization of the extracted features in the detection process, and can obtain a good detection result for the target domain image which does not appear in the training process.

It is emphasized that: the above embodiments are merely preferred embodiments of the present invention, and the present invention is not limited in any way, and any simple modification, equivalent variation and modification made to the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. The domain generalization target detection method based on the countermeasure generation network is characterized by comprising the following steps of:

Step (2) constructing a domain excitation attention module, wherein the domain excitation attention module inputs a feature map output by a backbone network and outputs the enhanced feature map; the domain excitation attention module comprises three operations of compression, excitation and classification; the compression operation compresses the input feature map from the dimension W×H×C to 1×1×C by adopting a global average pooling operation, wherein W, H, C represents the width, the height and the dimension of the input feature map respectively;

intermediate features F generated by classification operations with excitation operations _E As input, outputting domain category through a full connection layer and an activation function, namely a maximum response extraction function SoftMax;

step (4) constructing an countermeasure generation network regularization module, and realizing regularization of the features by aligning the input features with standard Gaussian distribution, so as to avoid network overfitting; the countermeasure generation network regularization module consists of a global average pooling, a discrimination network and standard Gaussian distribution; wherein the global averaging pooling compresses the input feature map from the dimension w×h×c to 1×1×c; the method comprises the steps of judging whether a network is a full convolution layer of two layers with an activation function, namely a logic Stir function Sigmoid, inputting the characteristics of global average pooling output, and judging whether the input characteristics come from an input of an countermeasure generation network regularization module or come from standard Gaussian distribution sampling; training a discrimination network to improve the discrimination accuracy and simultaneously promote the features extracted by the FPN to obey standard normal distribution so as to improve the generalization capability of the features;

constructing FPN, and fusing features of different scales extracted by a backbone network to realize multi-scale domain feature alignment; inserting an antagonism generation network regularization module after each scale of FPN is output so as to improve the generalization of the features; wherein FPN represents a feature pyramid network;

2. The method for detecting a domain generalization target based on an countermeasure generation network according to claim 1, wherein: in the step (3), the backbone network is a residual neural network ResNet-101.

3. The method for detecting a domain generalization target based on an countermeasure generation network according to claim 1, wherein: in the step (5), the FPN uses 5 feature maps with different scales, which are respectively denoted as F _i I= {3,4, …,7}; feature map F ₃ Corresponding to the object with the smallest scale, the feature map F ₇ Corresponding to the largest scale target.

4. The method for detecting a domain generalization target based on an countermeasure generation network according to claim 1, wherein: in the step (6), the FCOS detection head is used to predict the target position, classification and target center position.

5. The method for detecting a domain generalization target based on an countermeasure generation network according to claim 1, wherein: in the step (7), estimating a region map F of the object from the class output map of the detection head network _obj ：

foreground position estimation map F _CA As a region of interest estimation map in FPN output features, F is taken as _CA The FPN characteristics corresponding to the FPN characteristics are multiplied by channels to obtain a weighted characteristic diagram F _W ，F _W Judging the category of the network output domain by the input domain after passing through the gradient inversion module GRL;

R(x)＝x

wherein x represents any input feature, and I represents an identity matrix;

6. The method for detecting a domain generalization target based on an countermeasure generation network according to claim 1, wherein: in order to train the parameters of the domain excitation attention module in step (2), a domain excitation attention loss function L is proposed _atten ：

Wherein N represents the number of input samples, M represents the number of domain categories of the source domain, y _id Domain tag truth value, p, representing sample _id Representing domain categories of the excitation neural network output;

in order to train the parameters of the anti-generation network regularization module in the step (4), a distributed regularization loss function L is provided _regular ：

L _regular ＝E _h～q(h) logD(h)+E _x～p(x) log(1-D(G(x)))

Wherein q (h) obeys standard normal distribution, p (x) is distribution of input data, D represents domain discrimination network, G represents feature extraction network, E _h～q(h) Representing the expectation on the distribution q (h), E _x～p(x) Representing the desire over the distribution p (x);

for the purpose ofTraining parameters of the target central domain alignment module in the step (7), and providing a loss function L of the target central domain alignment module _CA ：

Wherein D represents a domain tag, D _CA Is a domain arbiter;respectively representing foreground position estimation graphs corresponding to a source domain and a target domain; f (F) _s ，F _t Respectively representing network feature graphs corresponding to a source domain and a target domain; />Representing element-by-element multiplication; for any one of the characteristic diagrams A, A ^(u，v) Representing the feature corresponding to the position (u, v) of the feature map;

network overall loss function L:

L＝L _det +αL _atten +βL _regular +γL _CA

L _det ＝L _cls +L _reg +L _ctr